Project EDIFES now has 779 cleaned building datasets ingested into HBase and (around) 20 working building markers. The purpose of this report is to document preliminary work towards a complete cross-sectional study of all buildings and markers.
Markers constantly change and results presented here reflect markers as of 12/1/2017.
Results were obtained via the StandardizeBuildingTypes function written by Shreyas Kamath.
| type | count |
|---|---|
| Banking | 1 |
| Educational | 74 |
| Entertainment | 10 |
| Food Sales & Service | 34 |
| Healthcare | 33 |
| Industrial | 32 |
| Office | 41 |
| Other | 25 |
| Public services | 15 |
| Retail | 106 |
| Services | 25 |
| Skyscraper | 15 |
| Storage | 7 |
| Utilities | 6 |
Completely pointless wordcloud of building types.
Results were obtained via functions I wrote for kgcz (mrk-climate_identifier.R) and ASHRAE (mrk-get_ashrae_cz.R) climate zone identification from latitude and longitude. The KGCZ is based on latitude and longitude with 0.5 degree precision, and the ASHRAE climate zone is based on querying the United States Census Bureau API to retrieve the county and matching that to a list of counties and climate zones.
Image to orient ourselves.
| kgcz | count |
|---|---|
| BSh | 10 |
| BSk | 14 |
| BWh | 1 |
| BWk | 1 |
| Cfa | 424 |
| Cfb | 14 |
| Csa | 17 |
| Csb | 54 |
| Dfa | 163 |
| Dfb | 78 |
Image to orient ourselves
| a_cz | count |
|---|---|
| 2A | 5 |
| 2B | 2 |
| 3A | 2 |
| 3B | 32 |
| 3C | 58 |
| 4A | 425 |
| 4C | 6 |
| 5A | 242 |
| 5B | 1 |
Data quality check building currently does not work for 35 datasets (all those with 1 minute interval data).
This shows all changes with more than 5 occurrences.
Now, to highlight the datasets that were not AAAP.
| Sampleset | AAAF | AABF | AABP | AACP | AADP | BAAF | BAAP | BADP | CAAP | CADP | DAAP |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CWRU | 0 | 0 | 0 | 1 | 6 | 0 | 0 | 0 | 0 | 0 | 1 |
| FirstE | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 2 |
| FirstE_ex | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 18 | 0 | 0 |
| JCI2 | 1 | 1 | 5 | 0 | 2 | 0 | 1 | 0 | 1 | 0 | 2 |
| JCI2_ex | 1 | 0 | 2 | 0 | 1 | 0 | 30 | 2 | 28 | 1 | 6 |
| KSU | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| Prog | 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Schools | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Starbucks | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Sampleset | AAAF | BAAF | BAAP | CAAF | CAAP | DAAF |
|---|---|---|---|---|---|---|
| CWRU | 1 | 0 | 0 | 0 | 0 | 1 |
| FirstE | 2 | 0 | 0 | 0 | 0 | 1 |
| FirstE_ex | 7 | 4 | 2 | 6 | 2 | 2 |
| JCI2 | 4 | 0 | 0 | 0 | 0 | 0 |
| JCI2_ex | 18 | 5 | 4 | 5 | 2 | 7 |
| KSU | 1 | 0 | 0 | 0 | 0 | 0 |
| Prog | 2 | 1 | 0 | 0 | 0 | 0 |
| Schools | 0 | 1 | 0 | 0 | 0 | 0 |
| Starbucks | 0 | 0 | 0 | 0 | 0 | 0 |
In the following map, the Starbucks locations are the stars. These demonstrate considerably different behavior than the other buildings which might explain why they are classified as non-electrical heating despite being located in the Southwest.
The current process to determine the heating type is as follows
Example plot of heating type. Conclusion from this plot is electrical heating.
Everything to the right of the black vertical line in the following plot is classified as non-electrical heating while everything to the left is classified as electrical.
We can segment the plot to between -0.1 and 0.1 because the majority of slopes fall in that range.
The question is where to draw the line for electrical heating. Currently the cut point is at a slope of 0, but this might need to be adjusted or we need to use a different method.
Thoughts?
This is annual consumption for the most recent year.
| stat | |
|---|---|
| Min. | 6.118e+03 |
| 1st Qu. | 1.522e+06 |
| Median | 2.349e+06 |
| Mean | 1.570e+07 |
| 3rd Qu. | 9.013e+06 |
| Max. | 1.446e+09 |
| NA’s | 9.000e+00 |
The red vertical line is the median at 2.349e6 kWh per year.
The energy use intensity (EUI) of a building is meant to compare buildings of different sizes in the same climate zone and of the same type. The EUI is designed to normalize the size of a building and EUI values differe significantly across industries.
The effective thermal resistance (r-value) of a building is a measure of the buildings resistance to heat loss. It can be used as a measure of the ‘tightness’ of a building’s insulation.
As long as we have the accurate sqaure footage and energy consumption, we should get the correct EUI value. However, the effective thermal resistance requires more complex thermodynamic analysis.
201 buildings have an energy use intensity value.
| buts | count | mean | median | sd | reference | |
|---|---|---|---|---|---|---|
| 1 | Educational | 28 | 2139.41 | 45.28 | 7933.45 | 73.10000 |
| 2 | Entertainment | 1 | 60.95 | 60.95 | NaN | 44.78750 |
| 3 | Food Sales & Service | 19 | 275.95 | 238.76 | 98.73 | 229.85000 |
| 4 | Healthcare | 10 | 330.93 | 123.85 | 465.56 | 97.92857 |
| 5 | Industrial | 20 | 1547.27 | 294.58 | 5470.15 | NaN |
| 7 | Office | 37 | 144.53 | 51.84 | 238.30 | NA |
| 8 | Public services | 2 | 78.91 | 78.91 | 4.22 | NA |
| 9 | Retail | 67 | 99.48 | 52.95 | 163.18 | 103.18333 |
| 10 | Skyscraper | 3 | 48.41 | 49.77 | 13.79 | NaN |
| 11 | Storage | 6 | 109.48 | 96.18 | 69.76 | 58.20000 |
| 12 | Utilities | 3 | 3545.07 | 2960.05 | 3557.68 | 40.69000 |
A good visualization for the distribution of EUI values is a boxplot. I filtered out EUI values greater than 500 which are highly suspect.
To give a sense of context, vacuum sealed panels, the top of the line insulation, have an effetive r-value of 50 hr F ft^2 / BTU.
In his thesis, Aaron cites a paper (Nordstrom et al. 2013[^1]) that examined R - values from 6 residential buildings in Sweden built from the 1960s to 2006 to validate the results he obtained from his function. The paper reports R - values between 9.1 to 23.7 hr F ft^2 / BTU.
![^1]G. Nordström, H. Johnsson, and S. Lidelöw, “Using the Energy Signature Method to Estimate the Effective U-Value of Buildings,” in Sustainability in Energy and Buildings, Springer, Berlin, Heidelberg, 2013, pp. 35–44.| buts | count | mean | median | sd |
|---|---|---|---|---|
| Educational | 28 | 387.60 | 222.95 | 514.17 |
| Entertainment | 1 | 176.54 | 176.54 | NaN |
| Food Sales & Service | 19 | 5.62 | 5.58 | 2.21 |
| Healthcare | 10 | 68.08 | 56.92 | 53.66 |
| Industrial | 20 | 43.70 | 31.06 | 61.51 |
| None | 8 | 14.90 | 3.37 | 29.45 |
| Office | 40 | 137.66 | 27.51 | 262.82 |
| Public services | 2 | 111.10 | 111.10 | 54.98 |
| Retail | 67 | 234.10 | 268.99 | 130.94 |
| Skyscraper | 3 | 264.32 | 250.00 | 252.83 |
| Storage | 6 | 117.12 | 89.58 | 80.77 |
| Utilities | 3 | 10.45 | 0.95 | 17.05 |
Boxplots of Effective Thermal Resistance. The red vertical lines indicate the typical range as reported in the paper.
This function needs some work, and I plan on addressing it over winter break. It is based on a thermodynamic model as documented by Aaron in his paper. Professor Abramson has validated the method, but the implementation might need an adjustment. Any ideas would be appreciated.
The Pearson correlation coefficient measures the strength and direction of a linear relationship between two variables. The following plots show the Pearson corelation coefficient between weather variables and energy consumption.
The following heatmaps show the average correlations between weather conditions energy consumption by climate zone. The dendograms cluster similar weather conditions and similar climate zones.
The base to peak ratio is the average base load divided by the average peak load. This marker is segmented by winter and summer and by year so we can look at changes between the seasons as well as changes over the years.
We can first look at the base to peak ratio statistics for the final year by sample set. These tables are grouped by season and arranged from lowest (best) to highest (worst) base to peak ratio.
| styp | season | count | mean | median | sd | pct_savings |
|---|---|---|---|---|---|---|
| Starbucks | winter | 19 | 0.2657895 | 0.250 | 0.0623891 | 0.2105263 |
| Schools | winter | 2 | 0.2950000 | 0.295 | 0.0212132 | 0.5000000 |
| Prog | winter | 8 | 0.4325000 | 0.390 | 0.1661110 | 0.8750000 |
| JCI2_ex | winter | 170 | 0.5208824 | 0.460 | 0.2014205 | 0.9117647 |
| FirstE_ex | winter | 30 | 0.5310000 | 0.510 | 0.2310299 | 0.8000000 |
| KSU | winter | 14 | 0.5585714 | 0.550 | 0.1037643 | 1.0000000 |
| JCI2 | winter | 358 | 0.5902235 | 0.560 | 0.1989692 | 0.9553073 |
| CWRU | winter | 37 | 0.6183784 | 0.660 | 0.1815598 | 0.9729730 |
| FirstE | winter | 137 | 0.6649635 | 0.710 | 0.2002209 | 0.9124088 |
| styp | season | count | mean | median | sd | pct_savings |
|---|---|---|---|---|---|---|
| Starbucks | summer | 19 | 0.2878947 | 0.270 | 0.0686034 | 0.3157895 |
| Prog | summer | 7 | 0.3242857 | 0.330 | 0.0877225 | 0.7142857 |
| Schools | summer | 2 | 0.3300000 | 0.330 | 0.1555635 | 0.5000000 |
| JCI2_ex | summer | 170 | 0.4711765 | 0.410 | 0.2262580 | 0.7882353 |
| KSU | summer | 14 | 0.4835714 | 0.425 | 0.1726093 | 0.9285714 |
| FirstE_ex | summer | 30 | 0.4923333 | 0.425 | 0.2430791 | 0.7666667 |
| JCI2 | summer | 358 | 0.5433799 | 0.505 | 0.2196496 | 0.8715084 |
| CWRU | summer | 37 | 0.5929730 | 0.620 | 0.2086713 | 0.8918919 |
| FirstE | summer | 137 | 0.6018248 | 0.630 | 0.2133770 | 0.8978102 |
We can also look at boxplots for each sampleset. The blue vertical line indicates the threshold established for savings opportunities.
As a sanity check, we can look at a plot showing the relationship between the ratio during the summer and winter. We would expect this to be a positively linear relationship.
The base to peak ratio is calculated for each year, so we can look at the changes over the years to see which buildings are improving.
Most buildings change relatively little, if at all over the years. Again, we can compare seasons to see if there is a correlation between the summer change in base peak ratio and winter change in ratio.
Ideally, a building would be in the upper right quadrant, with positive changes in both summer and winter.
| styp | season | imp_pct |
|---|---|---|
| CWRU | summer | 0.2500000 |
| CWRU | winter | 0.3888889 |
| FirstE | summer | 0.4104478 |
| FirstE | winter | 0.3955224 |
| FirstE_ex | summer | 0.5333333 |
| FirstE_ex | winter | 0.5000000 |
| JCI2 | summer | 0.3867069 |
| JCI2 | winter | 0.3867069 |
| JCI2_ex | summer | 0.4720497 |
| JCI2_ex | winter | 0.4720497 |
| KSU | summer | 0.4615385 |
| KSU | winter | 0.0769231 |
| Prog | summer | 0.6000000 |
| Prog | winter | 0.8000000 |
| Schools | summer | 0.5000000 |
| Schools | winter | 0.5000000 |
| Starbucks | summer | 0.7333333 |
| Starbucks | winter | 0.6000000 |
The HVAC schedule function finds the most likely turn on and turn off times for business and non-business days.
First, we can look at business day turn on and turn off times by sample set.
One other thing to look at is typical length of operating day.
| styp | mean_on | mean_off | hours |
|---|---|---|---|
| CWRU | 7.528 | 14.847 | 7.319 |
| FirstE | 7.897 | 18.222 | 10.325 |
| FirstE_ex | 5.800 | 18.683 | 12.883 |
| JCI2 | 8.914 | 19.442 | 10.527 |
| JCI2_ex | 8.605 | 19.404 | 10.799 |
| KSU | 5.911 | 16.714 | 10.804 |
| Prog | 5.250 | 18.344 | 13.094 |
| Schools | 4.625 | 16.500 | 11.875 |
| Starbucks | 3.882 | 21.763 | 17.882 |
We can make some correlation plots to determine relationships that exist between building markers. The quantitative numbers can also be printed to look at the possible trends.
| eui | r | summer_ratio | winter_ratio | log_annc | hours | |
|---|---|---|---|---|---|---|
| eui | 1.0000000 | -0.4139331 | 0.1992554 | 0.0773265 | 0.0843433 | 0.0180491 |
| r | -0.4139331 | 1.0000000 | -0.2083278 | -0.2105101 | -0.0945704 | -0.0584415 |
| summer_ratio | 0.1992554 | -0.2083278 | 1.0000000 | 0.8324141 | 0.4419049 | -0.3585436 |
| winter_ratio | 0.0773265 | -0.2105101 | 0.8324141 | 1.0000000 | 0.5862920 | -0.3625331 |
| log_annc | 0.0843433 | -0.0945704 | 0.4419049 | 0.5862920 | 1.0000000 | -0.2606391 |
| hours | 0.0180491 | -0.0584415 | -0.3585436 | -0.3625331 | -0.2606391 | 1.0000000 |
Another good option is to make pairwise plots. The diagonals show the distribution of the variable, and in the second plot, the asterisks indicate the significance of the relationship.
In order to meet ARPA-E milestone 4.1.1, we need to develop a predictive model that achieves an adjusted R2 greater than 0.85 when predicting six months. I wanted to test a random forest regression model for predictive capability. The details of the Random Forest are presented below, but the summary is the Random Forest is an extremely powerful model that maintains a level of interpretability.
The original paper describing Random Forests is by Leo Breiman.
To understand the powerful random forest, you first need to grasp the concept of a decision tree. The best way to describe a single decision tree is as a flowchart of questions about the variable values of an observation that leads in a classification/prediction. Each question (known as a node) has a yes/no answer based on the value of a particular variable. The two answer form branches leading away from the node. Eventually, the tree terminates in the final classification/prediction node called a leaf. A single decision tree can be arbitrarily large and deep depending on the number of features and the number of classes. They are adept at both classification and regression and can learn a non-linear decision boundary (they actually learn many small linear decision boundaries which collectively are non-linear). However, a single decision tree is very prone to overfitting, especially as the depth increases. The decision tree is flexible leading to a tendency to simply memorize the training data. To solve this problem, ensembles of decision trees are combined into a powerful classifier known as a random forest. Each tree in the forest is trained on a randomly chosen subset of the training data (either with replacement, called bootstrapping, or without) and on a subset of the features. This increases variability between trees making the overall forest more robust and less prone to overfitting. In order to make predictions, the random forest passes the features (values of variables) of the observation to all trees, and takes an average of the votes of each tree (known as bagging). The random forest can also weight the votes of each tree with respect to the confidence the tree has in its prediction. Overall, the random forest is fast, relatively simple, has a moderate level of interpretability, and performs extremely well on both classification and regression tasks. The random forest should be one of the first models tried on any machine learning problem and is generally my second approach after a linear model. There are a number of hyperparameters that must be specified for the forest ahead of time with the most important the number of trees in the forest, the number of features considered by each tree, the depth of the tree, and the minimum number of observations permitted at each leaf of the tree. These can be selected by training many different models with varying hyperparameters and selecting the combination that performs best on cross-validation or a testing set. A random forest performs implicit feature selection and can return the relative importances of the features so it can be used as a method to reduce dimensions for additional algorithms.
A simplified model of a decision tree used for exactly this task is presented below
In order to test the accuracy of the method, I trained the model on all data except for the final six months. I then took the final six months of data and made predictions for the electricity consumption. These predictions were compared to the known true values to assess the predictive capabilites of the random forest. This procedure was then completed for all buildings in HBase.
The prediction capabilities of the model have been tested against all buildings in HBase with fewer than 1 million datapoints. The average runtime to train and predict for a building is around 4 minutes (depending on number of datapoints).
The following are predictions made for the Progressive APS building in Phoenix, Arizona. The rsquared value for these predictions was 0.933.
Animated graphs as a good way to highlight changes over time and can compress a considerable amount of information into a single visual. I am not sure if these will be useful, but at the least they are fun to make!